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ABSTRACT 

With the rapid development of online social media, online shop- 
ping sites and cyber-physical systems, heterogeneous information 
networks have become increasingly popular and content-rich over 
time. In many cases, such networks contain multiple types of ob- 
jects and links, as well as different kinds of attributes. The clus- 
tering of these objects can provide useful insights in many applica- 
tions. However, the clustering of such networks can be challenging 
since (a) the attribute values of objects are often incomplete, which 
implies that an object may carry only partial attributes or even no 
attributes to correctly label itself; and (b) the links of different types 
may carry different kinds of semantic meanings, and it is a difficult 
task to determine the nature of their relative importance in helping 
the clustering for a given purpose. In this paper, we address these 
challenges by proposing a model-based clustering algorithm. We 
design a probabilistic model which clusters the objects of differ- 
ent types into a common hidden space, by using a user-specified 
set of attributes, as well as the links from different relations. The 
strengths of different types of links are automatically learned, and 
are determined by the given purpose of clustering. An iterative al- 
gorithm is designed for solving the clustering problem, in which 
the strengths of different types of links and the quality of cluster- 
ing results mutually enhance each other. Our experimental results 
on real and synthetic data sets demonstrate the effectiveness and 
efficiency of the algorithm. 

I. INTRODUCTION 

With the rapid emergence of online social media, online shop- 
ping sites and cyber-physical systems, it has become possible to 
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model many forms of interconnected networks as heterogeneous 
information networks in which objects (i.e., nodes) are of different 
types, and links among objects correspond to different relations, 
denoting different interaction semantics. An object is usually asso- 
ciated with some attributes. For example, in the case of the YouTube 
social media network, the object types include videos, users, and 
comments; links between objects correspond to different relations, 
such as publish and like relations between users and videos, post 
relation between users and comments, friendship and subscribe re- 
lations between users, and so on; and attributes include user's lo- 
cation, video's clip length and number of views, comments, and so 
on. 

Such kinds of heterogeneous information networks are ubiqui- 
tous and the determination of their underlying clusters has many in- 
teresting applications. For example, clustering objects (customers, 
products, comments, etc.) in an online shopping network such as 
eBay is helpful for customer segmentation in product marketing; 
and clustering objects (people, groups, books, posts, etc.) in an on- 
line social network such as Facebook is helpful for voter segmenta- 
tion in political campaigns. Another example is the weather sensor 
network, where different types of sensors may carry different nu- 
merical attributes and be linked by k nearest neighbor relationships. 
The clustering process may reveal useful regional weather patterns. 

The clustering task brings two new challenges in such scenarios. 
First, an object may contain only partial or even no observations for 
a given attribute set that is critical to determine their cluster labels. 
That is, a pure attribute-based clustering algorithm cannot correctly 
detect these clusters. Second, although links have been frequently 
used in networks to detect clusters [8, 17, 1, 23] in recent research, 
we consider a much more challenging scenario in which the links 
are of different types and interpretations, each of which may have 
its own level of semantic importance in the clustering process. That 
is, a pure link-based clustering without any guidance from attribute 
specification could fail to meet user demands. 




Figure 1: A Motivating Example on Clustering Political Inter- 
ests in Social Information Networks 
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Fig. 1 shows a toy social information network extracted from 
a political forum containing users, blogs written by users, books 
liked by users, and friendship between users. Now suppose we 
want to cluster users in the network according to their political in- 
terests, using the text attributes in user profiles, blogs and books, 
as well as the link information between objects. On one hand, 
since not all the users listed their political interests in their profiles, 
we cannot judge their political interests simply according to the 
text information contained in their profiles directly. On the other 
hand, without specifying the purpose of clustering, we cannot de- 
cide which types of links to use for the clustering: Shall we use the 
friendship links to detect the social communities, or the user-like- 
book links to detect the reading groups, or a mix of them? Obvi- 
ously, to solve such clustering tasks, we need to use both the incom- 
plete attribute information as well as the link information of differ- 
ent types with the awareness of their importance weights. In our 
example, in order to discover a user's political interests, we need to 
learn which link types are more important for our purpose of clus- 
tering, among the relationships between her and blogs, books, and 
her friends. 

Recently, there have been several studies [28, 18, 20, 25, 24, 16] 
showing that the combination of attribute and link information in a 
network can improve the clustering quality. However, none of these 
studies has addressed the two challenges simultaneously. Many 
of the studies [28, 20, 25] rely on a complete attribute space and 
the clustering result is considered as a trade-off between attribute- 
based measures and link-based measures. Moreover, none of the 
current studies has examined the issue that different types of links 
have different importance in determining a clustering with a certain 
purpose. 

In this study, we explore the interplay between different types of 
links and the specified attribute set in clustering process, and design 
a comprehensive and robust probabilistic clustering model for het- 
erogeneous information networks. First, we model each attribute 
attached with each object as a mixture model, with the mixing pro- 
portion as the soft clustering probability for each object. As it is a 
generative model, the incompleteness issue of the attributes is han- 
dled properly. Second, the importance of different types of links 
is modeled with different coefficients, which is determined by the 
consistency of cluster membership vectors over all the linked ob- 
jects. In other words, the cluster membership information of ob- 
jects are propagating in the whole network, but different types of 
links carry different capabilities in the propagation process. The 
goal is to determine the optimal levels of importance of the dif- 
ferent types of semantic links and the clustering results for objects 
simultaneously. An iterative method is proposed to learn the pa- 
rameters, where the clustering results and the importance weights 
for different link types are optimized alternately and mutually en- 
hance each other. 

The primary contributions of this paper are as follows. 

1. We propose a clustering problem for heterogeneous information 
networks with incomplete attributes across objects and different 
types of links, according to a user-specified attribute set that 
may be from different types. 

2. We design a novel probabilistic clustering model, which for the 
first time directly models the varying importance of different 
types of semantic links, for the above clustering problem. 

3. We propose an efficient algorithm to compute this model, where 
the clustering results and strengths for different typed links mu- 
tually enhance each other. 

4. We present experiments on both real and synthetic data sets to 
demonstrate the effectiveness and efficiency of the method. 



2. PROBLEM DEFINITION 

In this section, we introduce the notations, definitions and con- 
cepts relevant to the problem of clustering heterogeneous networks. 

2.1 The Data Structure 

A heterogeneous information network G — (V, E, W) is 

modeled as a directed graph, where each node v G V in the net- 
work corresponds to an object (or an event), and each link e e E 
corresponds to a relationship between the linked objects, with its 
weight denoted by w(e). Different from the traditional network 
definition, the objects and links in heterogeneous networks are as- 
sociated with explicit type information to distinguish the semantic 
meanings, namely, we have a mapping function from object to ob- 
ject type, t : V — > A, and a mapping function from link to link 
type, (j) : E — ¥ 72.. A is the object type set, and 1Z is the link 
type set, or the relation set, which provides linkage guidance be- 
tween nodes. Notice that, if a relation exists from type A to type 
B, denoted as ARB, the inverse relation R^ 1 holds naturally for 
B R~ x A. For most of the times, R and its inverse RT 1 are not 
equal, unless the two types are the same and R is symmetric. 

Attributes are associated with objects, such as the location of a 
user, the text description of a book, the text information of a blog, 
and so on. In this setting, we consider attributes across all differ- 
ent types of objects as a collection of attributes for the network, 
denoted as X = {X\, . . . , Xt}, in which we are interested only 
in a subset for a certain clustering purpose. Each object v G V 
contains a subset of the attributes, with observations denoted as 
v [X] = {x Vt i, x Vt 2, ■ ■ ■ , x v .n x v }, where Nx,v is the total num- 
ber of observations of attribute X attached with object v. Notice 
that, some attributes can be shared by different types of objects, 
such as the text and the location attribute; while some other at- 
tributes are unique for a certain type of objects, such as the clip 
time length for a video. We use Vx to denote the object set that 
contains attribute X. 

2.2 The Clustering Problem 

In this paper, we study the clustering problem that maps every 
object in the network into a unified hidden space, i.e., a soft clus- 
tering, according to the user-specified subset of attributes in the 
network, with the help of links from different types. 

There are several new challenges for clustering objects in this 
new scenario. First, the attributes are usually incomplete for an 
object: the attributes specified by a user may be only partially or 
even not contained in an object type; and the values for these at- 
tributes could be missing even if the attribute type is contained in 
the object type. Moreover, the incompleteness of the data cannot be 
easily handled by interpolation: the observations for each attribute 
could be a set or a bag of values, and the neighbors for an object 
are from different types of objects, which may not be helpful for 
predicting the missing data. For example, it is impossible to get 
a user's blog via interpolating techniques. Therefore, none of the 
existing clustering algorithms that purely based on attribute space 
can solve the clustering problem in this scenario. 

Second, with the awareness that links play a very important role 
to propagate the cluster information among objects, another chal- 
lenge is that different link types have different semantic mean- 
ings and therefore have different strengths in the process of passing 
cluster information around. In other words, while it is clear that the 
existence of links between nodes is indicative of clustering similar- 
ity, it is also important to understand that different link types may 
have a different level of importance in the clustering process. In 
the example of clustering political interests illustrated in Fig. 1, we 
expect a higher importance of the relation user-like-book than the 
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relation friendship in deciding the cluster membership of a user. 
Thus, we need to design a clustering model which can learn the 
importance of these link types automatically. This will enhance the 
clustering quality because it marginalizes the impact of low quality 
types of neighbors of an object during the clustering process. 

We present examples of clustering tasks in two concrete hetero- 
geneous information networks in the following. 

EXAMPLE 1. Bibliographic information network. A biblio- 
graphic network is a typical heterogeneous network, containing ob- 
jects from three types of entities, namely papers, publication venues 
(conferences or journals), and authors. Each paper has different 
link types to its authors and publication venue. Each paper is as- 
sociated with the text attribute as a bag of words. Each author and 
venue links to a set of papers, but contains no attributes (in our 
case). The application of a clustering process according to the text 
attribute in such a scenario can help detect research areas, and 
decide the research areas for authors, venues and papers. 




Figure 2: Illustration of Bibliographic Information Network 

Multiple types of objects and links in this network are illustrated 
in Fig. 2. For objects of different types, their cluster memberships 
may need to be determined by different kinds of information: for 
authors and venues, the only available information is from the pa- 
pers linked to them; for papers, both text attributes and links of 
different types are provided. Note that, even for papers that are 
associated with text attributes, using link information can further 
help the clustering quality when the observations of the text data 
is very limited (e.g., using text merely from titles). Also, we may 
expect that the neighbors of an author type play a more important 
role in deciding a paper's cluster compared with the neighbor of a 
venue type. This needs to be automatically learned in terms of the 
underlying relation strengths. 

EXAMPLE 2. Weather sensor network. Weather sensor net- 
works typically contain different kinds of sensors for detecting dif- 
ferent attributes, such as precipitation or temperature. Some sen- 
sors may have incorrect or no readings because of the inaccuracy 
or malfunctioning of the instruments. The links between sensors 
are generated according to their k nearest neighbors under geo- 
distances, in order to incorporate the importance of locality in 
weather patterns. The clustering of such sensors according to both 
precipitation and temperature attributes can be useful in determin- 
ing regional weather patterns. 

Fig. 3 illustrates a weather sensor network containing two types 
of sensors: temperature and precipitation. A sensor may sometimes 
register none or multiple observations. Although it is desirable to 
use the complete observations on both temperature and precipita- 
tion to determine the weather pattern of a location, in reality a sen- 
sor object may contain only partial attribute (e.g., temperature val- 
ues only for temperature sensors), and both the attribute and link 
information are needed for correctly detecting the clusters. Still, 
which type of links plays a more important role needs to be deter- 
mined in the clustering process. 




Figure 3: Illustration of Weather Sensor Information Network 

Formally, given a network G = (V, E,W), a specified subset of 
its associated attributes X £ X, the attribute observations ]} 
for all objects, and the number of clusters K, our goal is: 

1. to leam a soft clustering for all the objects v £ V, denoted 
by a membership probability matrix, Q\v\xk = (dv)vev, 
where Q(v, k) denotes the probability of object v in cluster k, 
< 0{v, k) < 1 and ]Tf =1 Q(v,k) = 1, and V is the K 
dimensional cluster membership vector for object v, and 

2. to learn the strengths (importance weights) of different link 
types in determining the cluster memberships of the objects, 
~f\n\xi' w h ere l{ r ) is a real number and stands for the im- 
portance weight for the link type r £ 1Z. 

Note that, in this paper we will not study the problem of how 
to determine the best number of clusters K, which belongs to the 
model selection problem and has been covered in a large number 
of studies by using various criteria [19, 12], such as AIC and BIC 
for probabilistic models. 

3. THE CLUSTERING MODEL 

We propose a novel probabilistic clustering model in this section 
and introduce the algorithm that optimizes the model in Section 4. 

3.1 Model Overview 

Given a network G, with the observations of its links and the 
observations {u[AT]} for the specified attributes X £ X, a. good 
clustering configuration O, which can be viewed as hidden cluster 
information for objects, should satisfy two properties: 

1. Given the clustering configuration, the observed attributes 
should be generated with a high probability. Especially, we 
model each attribute for each object as a separate mixture 
model, with each component representing a cluster. 

2. The clustering configuration should be highly consistent with 
the network structure. In other words, linked objects should 
have similar cluster membership probabilities, and larger 
strength of a link type requires more similarity between the 
linked objects of this type. 

Overall, we can define the likelihood of the observations of all 
the attributes X £ X as well as the hidden continuous cluster con- 
figuration O, given the underneath network G, the relation strength 
vector 7, and the cluster component parameter /3, which can be de- 
composed into two parts, the generative probability of the observed 
attributes given Q and the probability of 8 given the network struc- 
ture: 

P ({{vix]} veVx } xex ,e\G,t,i3) 
= n P({v[x]} veVx \e,f3)pmG,-y) (1) 

X£X 

From a generative point of view, this model explains how obser- 
vations for attributes associated with objects are generated: first, 
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a hidden layer of variables Q is generated according to the prob- 
ability p(Q\G, 7), given the network structure G and the strength 
vector 7; second, the observed values of attributes associated with 
each object are generated according to mixture models, given the 
cluster membership of the object, as well as the cluster component 
parameter f3, with the probability Ilxe;t P({ w [^']}«6K>c l®> P)- 

The goal is then to find the best parameters 7 and /3, as well as 
the best clustering configuration O that maximize the likelihood. 
The detailed modeling of the two parts is introduced in the follow- 
ing. 

3.2 Modeling Attribute Generation 

Given a configuration B for the network G, namely, the member- 
ship probability vector V for each object v, the attribute observa- 
tions for each object v are conditionally independent with observa- 
tions from other objects. Each attribute X associated with each ob- 
ject v is then assumed following the same family of mixture models 
that share the same cluster components, with the component mix- 
ing proportion as the cluster membership vector V . For simplicity, 
we first assume that only one attribute X is specified for the clus- 
tering purpose and then briefly discuss a straightforward extension 
to the multi-attribute case. 

3. 2. 1 Single Attribute 

Let X be the only attribute we are interested in the network, and 
let v[X] be the observed values for object v, which may contain 
multiple observations. It is natural to consider that the attribute ob- 
servation v [X] for each object v is generated from a mixture model, 
where each component is a probabilistic model that stands for a 
cluster, with the parameters to be learned, and component weights 
denoted by V . Formally, the probability of all the observations 
{v[X]}„ 6 y x given the network configuration 6 is modeled as: 

K 

p({v[x]} veVx \e,(3)= H H ^d v , kP (x\p k ) (2) 

v eV x x£v[X] k=l 

where K is the number of clusters, and (3 k is the parameter for 
component k. In this paper, we consider two types of attributes, 
one corresponding to text attributes with categorical distributions, 
and the other numerical attributes with Gaussian distributions. 

(1) Text attribute with categorical distribution: In this case, ob- 
jects in the network contain text attributes in the form of a term list, 
from the vocabulary / = 1 to m. Each cluster k has a different term 
distribution following a categorical distribution, with the parame- 
ter f3 k — (/3fc,i, . . . , Pk,m), where /3t,i is the probability of term 
/ appearing in cluster k, i.e., X\k ~ discrete(/3k,i, ■ ■ ■ , Pk,m)- 
Following the frequently used topic modeling method PLSA [11], 
each term in the term list for an object v is generated from the mix- 
ture model, with each component as a categorical distribution over 
terms described by /3 fc , and the component coefficient is V . For- 
mally, the probability of observing all the current attribute values 
is: 

m K 

p(Mx]} veVx \e,0)= H Y[(Y,8v, k h,i) c "- 1 ( 3 ) 

vev x 1=1 k=i 

where c,( denotes the count of term I that object v contains. 

(2) Numerical attribute with Gaussian distribution: In this case, 
objects in the network contain numerical observations in the form 
of a value list, from the domain E. The klh cluster is a Gaussian dis- 
tribution with parameters (3 k — (fi k ,a k ), i.e., X\k ~ Af(fJ,k, o~ k ), 
where fi k and a k are mean and standard deviation of normal distri- 
bution for component k. Each observation in the observation list for 
an object v is generated from the Gaussian mixture model, where 
each component is a Gaussian distribution with parameters \i k ,o\, 



and the component coefficient is V . The probability density for all 
the observations for all objects is then: 

p({v[X]} veVx \e,(3)= H n J2 d v, kn =e 

vev x xev[x]k=i y 27ro "i 

(4) 

3.2.2 Multiple Attributes 

As in the weather sensor network example, we are interested in 
multiple attributes, namely temperature and precipitation. Gener- 
ally, if multiple attributes in the network are specified by users, say 
X\, . . . ,Xt, the probability density of observed attribute values 
{v[Xl]}, . . . , {«[Xt]} for a given clustering configuration 6 is as 
follows, by assuming the independence among these attributes: 

p({v[X 1 ]} veVxi {v[X T ]} veVxT \0,f3 u ...,f3 T ) 

J* (5) 

= l[p({v[X t ]} veVxt \0,l3 t ) 

3.3 Modeling Structural Consistency 

From the view of links, the more similar the two objects are in 
terms of cluster membership, the more likely they are connected 
by a link. In order to quantitatively measure the consistency of a 
clustering result G with the network structure G, we define a novel 
probability density function for observing Q. 

We assume that linked objects are more likely to be in the same 
cluster, if the link type is of importance in determining the cluster- 
ing process. That is, for two linked objects Vi and Vj, their mem- 
bership probability vectors Oi and Oj should be similar. Within the 
same type of links, the higher link weight (w(e)), the more similar 
Gi and Oj should be. Further, a certain link type may be of greater 
importance, and will influence the similarity to a greater extent. 
The consistency of a configuration 6 with the network G, is eval- 
uated with the use of a composite analysis with respect to all the 
links in the network in the form of a probability density value. A 
more consistent configuration of O will yield a higher probability 
density value. In the following, we first introduce how the consis- 
tency of two cluster membership vectors is defined with respect to a 
single link, and then how this analysis can be applied over all links 
in order to create a probability density value as a function of O. 

For a link e = (vi,Vj) £ E, with type r = 0(e) G TZ, we 
denote the importance of the link type to the clustering process by 
a real number 7(Y). This is different from the weight of the link 
w(e), which is specified in the network as input, whereas the value 
of 7(7- ) is defined on link types and needs to be learned. We denote 
the consistency function of two cluster membership vectors Oi and 
Oj with link e under strength weights for each link type 7 by & fea- 
ture function f(Oi, Oj, e,7). Higher values of this function imply 
greater consistency with the clustering results. In the following, we 
list several desiderata for a good feature function: 

1. The value of the feature function / should increase with greater 
similarity of Oi and Oj. 

2. The value of the feature function / should decrease with greater 
importance of the link e, either in terms of its specified weight 
w(e), or learned importance 7(r). In other words, for the larger 
strength of a particular link type, two linked nodes are required 
to be more similar to claim the same level of consistency. 

3. The feature function should not be symmetric between its first 
two arguments Oi and Oj , because the impact from node Vi to 
node Vj could be different from that of Vj to v t . 
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The last criterion requires some further explanation. For exam- 
ple, in a citation network, a paper i may cite paper j, because i feels 
that j is relevant to itself, while the reverse may not be necessarily 
true. In the experimental section, we will show that asymmetric 
feature functions produce higher accuracy in link prediction. 

We then propose a cross entropy-based feature function, which 
satisfies all of the desiderata listed above. For a link e = (vi, Vj) G 
E, with relation type r = 4>(e) G 1Z, the feature function 
f(0i, Oj, e, 7) is defined as: 

K 

f(9 i ,9 j ,e,-y) = - 1 (r)w(e)H(0 j ,e i )= 1 (r)w(e) £ e jzk logO izk 

fc=i 

where H(0j,6i) = — ^2 k=1 9j,k'iog6i,k, is the cross entropy 
from 0j to 0j, which evaluates the deviation of vj from Vi, in terms 
of the average coding bits needed if using coding schema based 
on the distribution of t . For a fixed value of 7(Y), the value of 
H(0j, 0i) is minimal and (therefore) / is maximal, when the two 
vectors are identical. It is also evident from Eq. (6) that the value 
of / decreases with increasing learned link type strength 7(7") or 
input link weight w(e). We require 7 > 0, in the sense that we do 
not consider links that connect dissimilar objects. The value of / 
so defined is a non-positive function, with larger value indicating a 
higher consistency of the link. 

Other distance functions such as KL-divergence could replace 
the cross entropy in the feature function. However, as cross entropy 
favors distributions that concentrate on one cluster (H(0j,0i) 
achieves the lowest distance, when 0j — t and i:k = 1 for 
some cluster k), which agrees with our clustering purpose, we pick 
it over KL-divergence. 

(1/3, 1/3, 1/3) 

' 16 Mt j (1/16,1/16,7/8) (1/12. 1/12,5/6) (1/12,5/6,1/12) (^J : Paper 

/\ : Author 

Tm 1/3, 137 
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Figure 4: Illustration of Feature Function 

Fig. 4 illustrates a small example of a bibliographic network con- 
taining 7 objects. For clarity, we only draw the out-links of two 
objects corresponding to Paper 1 and Author 4. The weights of all 
links are 1, and the given membership vector with respect to three 
clusters is shown in the figure. Three link types are contained in the 
network, corresponding to write(author, paper) with strength 
weight 71, publishedJby (paper, venue) with weight 72, and 
tor Men Jby (paper, author) with weight 73. From the example, 
we can see that: 

1. Objects 1 and 3 are more likely to belong to the first cluster, Ob- 
ject 4 is a neutral object, and Object 5 is more likely to belong to 
the third cluster. With Eq. (6), we get f((l, 3)) = -O.47OI73; 
f((l, 4)) = -1.717473; and f((l, 5» = -2.34107 3 . In other 
words, /((1,3)) > /«1,4)) > /((1,5)). This satisfies the 
first desired criterion. 

2. /((1,2)) = -O.47OI72 and /((1,3)) = -O.47OI73. If 
72 < 73, (or, the strength of link type publishedJby is smaller 
than writtenj>y), then f((l, 2)) > f((l, 3)). That is to say, in 
order to obtain the same value for two feature functions defined 
on two different link types, the link type with stronger strength 



requires even greater similarity for the membership vectors. In 
other words, stronger link types are likely to exist only between 
objects that are very similar to each other, and indicate a better 
quality of the link type. 

3. /((1,4)) = -1.717473, /((4, 1» = -1.098671, and in gen- 
eral f((l, 4)) / f((4, 1)). Even if the two links belong to the 
same type, i.e., 73 == 71, we still have f((l, 4)) < /((4, 1)). 
The intuitive explanation is that it is less helpful for a neutral ob- 
ject to decide an object's expertise than for an expert object to 
decide whether an object is neutral. Therefore, the asymmetric 
criterion holds as well. 

We then propose a log-linear model to model the probability of G 
given the link type weights 7, where the probability of one config- 
uration 6 is defined as the exponential of the summation of feature 
functions of all the links in G: 



p(e|G,Tf) 



cxp{ Yl f(0i,0j,e,-y)} 



(7) 



where 7 is the strength weight vector for all link types, 
f(0i,0j,e,*f) is the feature function defined on links of 
different types, and Z(*y) is the partition function that 
makes the distribution function sum up to 1: Z(-y) = 
J® ex P{Y, e =(v i ,v j )eE /(#»> ji e > l)}dO- The partition function 
Z(f) is an integral over the space of all the configurations 0, and 
it is a function of 7. 

3.4 The Unified Model 

The overall goal of the network clustering problem is to deter- 
mine the best clustering results O, the link type strengths 7 and 
the cluster component parameters (3 that maximize the generative 
probability of attribute observations and the consistency with the 
network structure, described by the likelihood function in Eq. (1). 

Further, we add a Gaussian prior to 7 as a regularization to avoid 
overfitting, with the mean as 0, and the covariance matrix as a 2 1, 
where a is the standard deviation of each element in 7, and I is 
the identity matrix. We set a — 0.1 in our experiments, and more 
complex strategy can be used to select a according to labeled clus- 
tering results, which will not be discussed here. The new objective 
function is then: 
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g(0,/3,T) =log J2 p({v[X]}vev x \e,p) + \ogp(e\G,-y) 
xex 

(8) 

In addition, we have the constraints that 7 > 0, and some con- 
straints for (3 that are dependent on the attribute distribution type. 
Also, p({v[X]} ve v x |0, 0) and p(Q\G, 7) need to be replaced by 
the specific formulas proposed above for concrete derivations. 

4. THE CLUSTERING ALGORITHM 

This section presents a clustering algorithm that computes the 
proposed probabilistic clustering model. Intuitively, we begin with 
the assumption that all the types of links play an equally important 
role in the clustering process, then update the strength for each type 
according to the average consistency of links of that type with the 
current clustering results, and finally achieve a good clustering as 
well as a reasonable strength vector for link types. It is an iter- 
ative algorithm containing two steps in that clustering results and 
strengths of link types mutually enhance each other, which maxi- 
mizes the objective function of Eq. (8) alternatively. 

In the first step, we fix the link type weights 7 to the best value 
7*, determined in the last iteration, then the problem becomes that 
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of determining the best clustering results and the attribute pa- 
rameters (3 for each cluster component. We refer to this step as the 
cluster optimization step: [0*,/3*] = arg max <?(0, (3, 7*). 

In the second step, we fix the clustering configuration parameters 
= 0* and (3 = (3* , corresponding to the values determined in 
the last step, and use it to determine the best value of 7, which is 
consistent with current clustering results. We refer to this step as 
the link type strength learning step: 7* = arg max p(0* , (3* , 7) . 

The two steps are repeated until convergence is achieved. 

4.1 Cluster Optimization 

In the cluster optimization step, each object has the link informa- 
tion from different types of neighbors, where the strength of each 
type of link is given, as well as the possible attribute observations. 
The goal is to utilize both link and attribute information to get the 
best clustering for all the objects. Since 7 is fixed in this step, the 
partition function and regularizer term become constants, and can 
be discarded for optimization purposes. Therefore, we can con- 
struct a simplified objective function gi(-, •), which depends only 
on 6 and (3: 

K 

9i(e,/3)= Y. HOi,0j,e,f)+ Y Y, lo sE 8v, k p(x\0 k ) 
e=(v i ,v j ) "efj'&W fc =i 

(9) 

We derived an EM-based algorithm [9, 4] to solve Eq. (9). In 
the E-step, the probability of each observation x for each object v 
and each attribute X belonging to each cluster, usually called the 
hidden cluster label of the observation, derived according 

to the current parameters and (3. In the M-step, the parameters 
and (3 are updated according to the new membership for all the 
observations in the E-step. The iterative formulas for single text 
attribute, single Gaussian attribute, and two Gaussian attributes are 
provided below. 

1. Single Categorical text attribute: Let z v ,i denote the hidden 
cluster label for the Ith term in the vocabulary for object v, i_1 
be the value of at iteration t — 1, and be the value of (3 at 
iteration t — 1. l{„ e v x } is the indicator function, which is 1 if v 
contains this attribute, otherwise 0. Then, we have: 

<*« E 7We)Me)e*7fc+ 1 {«€V A: }E c ''.'P(<i=fe|© t ~ 1 ./3 t 

e=(v,u) 1 = 1 

Pt.i « Y cvM4,i = fe|e'- 1 ,/3 t - 1 ) 

„£V X 

(10) 

2. Single Gaussian numerical attribute: Let z v , x denote the hid- 
den cluster label for the observation x for object v, 0* be the value 
of at iteration t, and p} k and o\ be the values of mean and stan- 
dard deviation for fcth cluster at iteration t. l{ veVx y is the indicator 
function, which is 1 if v contains this attribute, otherwise 0. Then, 
we have: 



3. Two Gaussian numerical attributes: Let X, Y be two at- 



tributes following Gaussian distributions, z Vi . 
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E veVx T, xSvlx] P «. = fc|e*-i,/9*- 1 ) 

£„ 6 v x E. e „ [x] p«* = fcie'-S/s*- 1 ) 



denote the hid- 



den cluster labels of the observation x for attribute X and the ob- 
servation y for attribute Y respectively for object v, 0* be the value 
of at iteration t, and p-x.h, Hy,k an d °x,fc> °Y,k be the values of 
mean and standard deviation for fcth cluster of attribute X and Y 
at iteration t. l{ veVx y and l{„ s vy} are the indicator functions, 
which are 1 if v contains X or Y, otherwise 0. Then, we have: 
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<k« Y 7(0(e))^(e)<7 fc 1 + l{„ e v x} £ P(<x=fc|«' 

e=(v,u) .£»W 

J 3 t - 1 ) + l { „ eVy } £ ?(<» = fe|©'- 1 ,/3*- 1 ) 
ve«[v] 

£„ e v x Ex e „[x] *p(*i,. = fcie'- 1 ,/^- 1 ) 



£„ e v x E» e „ [xl (^-^,fe) 2 p(4,« = fc|e*- 1 ,/3'- 1 ) 



E„ £ v x E, 6 „ m p(4„ =fe|e*-i,^- 
OT « „ = fcie*- 1 ,/?'- 1 ) 



/ 2 , t 



E„ e i V i:, e «[V]P«, = fc|® i - 1 ./9 t - 1 ) 

£„ € y v £ se „[n(» - A^,J 2 P« B = fcie*- 1 ,^'" 1 ) 



E, 6 i V i:, 6 .[v]P«, = fc|e t - 1 ./8 i - 1 ) 



(12) 

A more detailed derivation of the EM algorithm is provided for 
single text attribute in Appendix A, which is similar for single or 
multiple Gaussian numerical attributes. 

From the update rules, we can see that the value of the member- 
ship probability for an object is dependent on its neighbors' mem- 
berships, the strength of the link types, the weight of the links, and 
the attribute associated with it (if any). When an object contains 
no attributes in the specified set, or contains no observations for the 
specified attributes, the cluster membership is totally determined 
by its linked objects, which is a weighted average of their cluster 
memberships and the weight is determined by both the weight of 
the link and the weight of the link type. When an object contains 
some observations of the specified attributes, its cluster member- 
ship is determined by both its neighbors and these observations for 
each possible attribute. 

4.2 Link Type Strength Learning 

The link type strength learning step is to find the best strength 
weight for each type of links that makes the current clustering re- 
sult to be generated with the highest probability. By doing so, the 
low quality link types that connect objects not so similar will be 
punished and assigned with low strength weights; while the high 
quality link types will be assigned with high strength weights. 

Since the values of and f3 are fixed in this step, the only rel- 
evant parts of the objective function (for optimization purposes) 
are those which depend on 7. These are the structural consistency 
modeling part and the regularizer over 7. Therefore, we can con- 
struct the following simplified objective function 52 (• ) as a function 
of 7: 



2(7)= E f(.0i,0j,e,-y)-lo g Z(-y) 



INI 2 
2a 2 



(13) 



(11) 



In addition, we have the linear constraints as 7 > 0. 
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Input: Network G, Attribute X± , . . . , Xj-, cluster number K; 
Output: Cluster membership 0, Link type weights -y, attribute 
component parameters /3 1; . . . , f3 T ; 

Initialization for •7 ; 
repeat 

%Step 1: Optimization of©' given -y 4-1 ; 
Initialize B'°,f3' ; 
repeat 

1. for each object v, update p(z° x = fc|e' s -\ /3 /s_1 ) 

2. for each object v, update 9^ fe ; 

3. for each cluster k, update parameter for each attribute 
until reaches precision requirement for & s ; 

e* = e /s ; 

/3* = P' s ; 

%Step 2: Optimization of -y* given 
yo _ y-i . 

repeat 

1. V s = V s " 1 - [fr<?i(V s - 1 )]- 1 v^(y s - 1 ); 

2. Vr 6 7£, if 7'(r) 3 < 0, set 7 '(r) s = 0; 
until reaches precision requirement for -/ a \ 

until reaches iteration number or precision requirement for~f\ 
Algorithm 1: The GenClus Algorithm. 



However, 52 is difficult to be optimized directly, since the parti- 
tion function Z(-y) is an integral over the entire space of valid val- 
ues of O, which is intractable. Instead, we construct an alternate 
approximate objective function g' 2 , which factorizes logp(9|G) 
as the sum of logp(6>ij0_;, G), namely the pseudo-log-likelihood, 
where p(6i\6-i, G) is the conditional probability of 6i given the 
remaining objects' clustering configurations, which turns out to be 
dependent only on its neighbors. The intuition of using pseudo- 
log-likelihood to approximate the real log-likelihood is that, if the 
probability of generating the clustering configuration for each ob- 
ject conditional on its neighbors is high, the probability of generat- 
ing the whole clustering configuration should also be high. In other 
words, if the local patches of a network are very consistent with the 
clustering results, the consistency over the whole network should 
also be high. 

In particular, we choose each local patch of the network as an 
object and all its out-link neighbors. In this case, every link is con- 
sidered exactly once, and the newly designed objective function 
g' 2 {-) is as follows: 



92 W = £( £ m,0i,e,t)-lo e Z i (-y))-in}L. (14) 

« = 1 e=(v i ,Vj) 

where log Z;( 7 ) = log / e E e=<">.^> /(e *'^' e ' 7) d0 t , the local 
partition function for object Vi, with the linear constraints 7 > 0. 

As the joint distribution of 6 as well as the conditional distri- 
bution of 6i given its out-link neighbors are both belonging to ex- 
ponential families, both g 2 and g 2 are concave functions of 7, and 
the concavity of g' 2 is proved in Appendix B. Therefore, the maxi- 
mum value is either achieved at the global maximum point or at the 
boundary of constraints. The Newton-Raphson method is used to 
solve the optimization problem. It needs to calculate the first and 
second derivative of g' 2 (7) with respect to 7, which is non-trivial 
in our case. We discuss the computation of these below. 

By re-examining p(0i | {Gj } Ve= ^. ) , G), the conditional prob- 
ability for each object i given its out-link neighbors, we have: 

p(O i \{0 j }v e=(vi , Vj} ,G)ocY[0 ik { '• > } (15) 
fc=i 

It is easy to see thatp(0 i |{0 J } Ve= ( t , i ,„ j ) , G) is aDirichlet distribu- 
tion with parameters a ik = Y,e=(v i: v) lW e )) w { e Wi,k + 1. f° r 
k = 1 to K. Therefore, the local partition function for each object 
i, Zi(^), should be the constant B(on) as in Dirichlet distribution, 

where on = (an, a iK ) and B(on) = njk r( ° ifc) Then the 

1 \2^fz = i a ik) 

first and second derivatives (Vg' 2 and Hg 2 ) can be calculated now 
as each Zi is a function of Gamma functions. 

The first derivative (or gradient) of g' 2 is expressed as: 

I V| K 
Vs 2 (r) = E( E w(e)J2 9iklo S e ik 

.= 1 e=(v i ,v j ) k = l 

*(e) = r 

-(X>(«^) E «»(e)flj* - WE °«*) E 

fc = l e=(v i ,v j ) k = l e=iv i ,v j ) 

*(e)=r *(e) = r 

(16) 

for every r G TZ, where tp(x) is the digamma function that is the 
first derivative of log T(x), namely tp(x) = T'(x) /T(x). 



The second derivative (or Hessian matrix) of g' 2 , can be ex- 
pressed as: 

n K 

Hg' 2 ( ri ,r 2 ) = E (- E ^'("ifc) E w ( e )°jk E w ( e ) e 3>* 

• =1 k=l e=(v i ,v j ) e=(v i ,v j ) 

K 1 

+v'(E E w ^ E — 2 1 i'-i=''2} 

k=l e=(v i ,v j ) e=( l , i ,„ J .> 

V'(e) = r 1 V(e) = r 2 

(17) 

for every pair of relations ri,r 2 € TZ, where ip'(x) is the first 
derivative of if)(x), and l{ ri=r2 j is the indicator function, with the 
value 1 if n = r 2 , and otherwise. 

Then, we can use the Newton-Raphson method to determine the 
value of 7 that maximizes g' 2 with the following iterative steps: 

1. -Y t+1 ='T t -[Hg 2 ('Y t )]- 1 Vg 2 ('y t y, 

2. Vr e TZ, if 7(r) t+1 < 0, set7(r) i+1 = 0. 

4.3 Putting together: The GenClus Algorithm 

We integrate the two steps discussed above to construct a 
Gen eral Heterogeneous Network Clus tering algorithm, GenClus, 
as shown in Algorithm 1 in pseudo code. 

The algorithm includes an outer iteration that updates O and 7 
alternatively, and two inner iterations that optimize O using the 
EM algorithm and optimize 7 using the Newton-Raphson method 
respectively. For the initialization of 7 in the outer iteration, we 
initialize it as an all- 1 vector. This means that all the link types 
in the network are initially considered equally important. For the 
initialization of O' in the inner iteration for optimizing O, we can 
either (1) assign O' with random assignments, or (2) start with 
several random seeds, run the EM algorithm for a few steps for 
each random seed, and choose the one with the highest value of the 
objective function g± as the real starting point. The latter approach 
will produce more stable results. 

The time complexity for the EM algorithm in the first step is 
0(ti(Kdi\V\ + K\E\), where t\ is the number of iterations, d\ is 
the average number of observations for each object, K is the num- 
ber of clusters, | V| is the number of objects, and | J57| is the number 
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of links in the network, which is linear to \ V\ for sparse networks. 
The time complexity of the algorithm in the step of maximizing 7 is 
dependent on the time for calculating the first derivative and Hes- 
sian matrix of 32(7), and the matrix inversion involved Newton- 
Raphson algorithm. This is 0(K\E\ + t 2 \R\ 2 ' S76 )), where K and 
\E\ are with the same meaning as before, ti is the number of itera- 
tions, and \R\ is the number of relations in the network. In all, the 
overall time complexity is 0(t(h(Kdi\V \ +K\E\)+t 2 \R\ 2 - 376 )), 
where t is the number of outer iterations. In other words, for 
each outer iteration, the time complexity is approximately linear in 
the number of objects in the network when the network is sparse. 
Therefore, the GenClus algorithm is quite scalable. 

5. EXPERIMENTAL RESULTS 

In this section, we examine the effectiveness and efficiency of 
the clustering algorithm on several real and synthetic data sets. 

5.1 Data Sets 

Two real networks and one synthetic network are used in this 
study. From the DBLP Four-area data set [23] [10], we extracted 
two networks where the network structures are represented by dif- 
ferent subsets of entities and their corresponding links. This data 
set was extracted from 20 major conferences from the four areas 
corresponding to database, data mining, machine learning, and in- 
formation retrieval. It contains 14376 papers and 14475 authors, 
corresponding to publications before 2008. Labels were associated 
with a subset of the nodes, and specifically with 20 conferences, 
100 papers, and 4236 authors into 4 areas. Besides the real net- 
works, we also generated a synthetic weather sensor network. We 
describe these networks below: 

(a) DBLP Four-area AC Network. This network contains 
two types of objects, authors (A) and conferences (C); and 
three types of links depending upon publication behavior, namely 
publishJn(A, C) (abbr. as {A, C}), published -by(C ', A) (abbr. 
as (C, A)), and coauthor(A, A) (abbr. as {A, A)). The links are 
associated with a weight corresponding to the number of papers 
that an author has published in a conference, a conference is con- 
tributed by an author, and the two authors have coauthored, respec- 
tively. The author nodes and conference nodes contain text corre- 
sponding to the text from the titles of all the papers they have ever 
written or published. 

(b) DBLP Four-area ACP Network. This network contains ob- 
jects corresponding to authors (A), conferences (C) and papers 
(P); and four types of links depending upon the publication be- 
havior, namely write(A, P) (abbr. as (A, P}), writtenJby{P, A) 
(abbr. as (P,A)), publish{C,P) (abbr. as (C,P)), and 
published Jby(P, C) (abbr. as {P, C)). In this case, the links have 
binary weights, corresponding to presence or absence of the link. 
Only papers contain text attributes, extracted from their titles. 

(c) Weather Sensor Network. This network is synthetically gen- 
erated, containing two types of objects: temperature (T) and pre- 
cipitation (P) sensors, and four link types between any two types of 
sensors denoting the kNN relationship: (T, T), (T, P), (P, T), and 
(P, P). The links are binary weighted according to their fc-nearest 
neighbors. The attributes associated with a sensor correspond to 
either temperature or precipitation, depending on the type of the 
sensor. 

The weather sensor network is generated by assuming there are 
K weather patterns, each of which is defined as a Gaussian distri- 
bution over temperature and precipitation attributes with different 
parameters. The links are built according to the fc-nearest neighbors 
relationship. The temperature and precipitation observations are 
generated by sampling. The details of the sensor network generator 



is introduced in Appendix C. We use the weather network genera- 
tor to generate two sets of synthetic climate sensor networks, each 
containing 4 clusters, and each sensor is linked to 5 nearest neigh- 
bors for each type (10 in total). The first set of networks have at- 
tribute means as (1, 1), (2, 2), (3, 3), (4, 4) for each cluster, and the 
standard deviation for both attributes is set to 0.2. The correlation 
between temperature and precipitation is 0. The second set of net- 
works have attribute means as (1, 1), (— 1, 1), (— 1, —1), (1, —1) 
for each cluster, with the same covariance matrix as the first set- 
ting. Notice that Setting 2 is more difficult than Setting 1, in the 
sense that the weather pattern can only be determined when we 
know both the temperature and precipitation observations for each 
location. The temperature sensors have soft cluster membership in 
two neighboring clusters (less noisy); while precipitation sensors 
have soft membership in three neighboring clusters (more noisy). 
In each setting, we vary the number of sensors, by fixing the num- 
ber of temperature sensors as 1000, and precipitation sensors as 
250, 500, and 1000. For each setting, the number of observations 
for each object may be 1, 5 or 20. In all, for each weather pattern 
setting, we have 9 networks with different configurations. 

5.2 Effectiveness Study 

We use two measures for our effectiveness study. First, the la- 
bels associated with the nodes in the data sets provide a natural 
guidance in examining the coherence of the clusters. We use Nor- 
malized Mutual Information (NMI) [21] to compare our clustering 
result with the ground truth, which evaluates the similarity between 
two partitions of the objects. Second, we use link prediction accu- 
racy to test the clustering accuracy. The similarity between two 
objects can be calculated by a similarity function defined on their 
two membership vectors, such as cosine similarity. Clearly, a bet- 
ter clustering quality leads to better computation of similarity (and 
therefore the better accuracy of link prediction). For a certain type 
of relation (A, B), we calculate the similarity scores between each 
va £ A and all the objects vb € B, and compare the similarity- 
based ranked list with the true ranked list determined by the link 
weights between them. We use the measure Mean Average Preci- 
sion (MAP) [27] to compare the two ranked links. 

5.2.1 Clustering Accuracy Test 

We choose clustering methods that can deal with both links and 
attributes as our baselines. None of these baselines is capable of 
leveraging different link types in terms of their differential im- 
pact to the clustering process. Therefore, we set each link type 
strength as 1 for these baselines. Second, we choose different base- 
lines for clustering networks with text attributes and clustering net- 
works with numerical attributes, since there is no unified clustering 
method (other than our presented GenClus) that can address both 
situations in the same framework. 

For DBLP Four-area AC Network and DBLP Four-area ACP 
Network that are with text attributes, we use NetPLSA [18] and 
iTopicModel [22] as baselines, which aim at improving topic quali- 
ties by using link information in homogeneous networks. We com- 
pare GenClus with these baselines by assuming homogeneity of 
links for the latter two algorithms. The number of iterations of Gen- 
Clus is set to 10. Each algorithm is run for 20 times with random 
initial settings. The mean and standard deviation of NMI of the 
20 running results are shown for the DBLP AC Network and DBLP 
ACP Network in Figs. 5 and 6 respectively. From the results, we 
can see that GenClus is much more effective than iTopicModel and 
NetPLSA in both networks, due to the ability of GenClus to learn 
and leverage the strengths of different link types in the clustering 
process. Furthermore, the standard deviation of NMI over differ- 
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ent runs is much lower for GenClus. which suggests that the algo- 
rithm is more robust to the initial settings with the learned strength 
weights for different link types. 
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Figure 5: Clustering Accuracy Comparisons for AC Network 




Figure 6: Clustering Accuracy Comparisons for ACP Network 

The AC Network is the easiest case among the three networks, 
since it only contains one type of attribute (the text attribute), and 
all object types contain this attribute, namely, the attribute is com- 
plete for every object. The ACP network is a more difficult case, 
because not every type of objects contains the text attributes. This 
requires the clustering algorithm to be more robust to deal with ob- 
jects with no attributes. From the results, we can see that GenClus 
is more robust than NetPLSA, which outputs almost random pre- 
dictions for authors for the ACP network. Although iTopicModel 
performs better for objects of type C for the ACP network (see Fig. 
6), GenClus still has an overall better performance. This is because 
our objective function is defined over all the object types rather than 
on a particular type. 

We also examined the actual clusters obtained by the algorithm 
on the DBLP AC network, and list the corresponding cluster mem- 
berships for several well-known conferences and authors in Table 
1, where the research area names are given afterwards according 
to the clustering results. We can see that the clustering results of 
GenClus are consistent with human intuition. 



Object 


DB 


DM 


IR 


ML 


SIGMOD 


0.8577 


0.0492 


0.0482 


0.0449 


KDD 


0.0786 


0.6976 


0.1212 


0.1026 


CIKM 


0.2831 


0.1370 


0.4827 


0.0971 


Jennifer Widom 


0.7396 


0.0830 


0.1061 


0.0713 


Jim Gray 


0.8359 


0.0656 


0.0536 


0.0449 


Christos Faloutsos 


0.4268 


0.3055 


0.1380 


0.1296 



Table 1: Case Studies of Cluster Membership Results 

The synthetic weather sensor network is the most difficult case 
among the three networks, as it has two types of attributes cor- 
responding to different types of sensors. Furthermore, all sensor 
nodes contain incomplete attributes. Existing algorithms cannot ad- 
dress these issues well. We compare the clustering results of Gen- 
Clus with two baselines, by comparing the cluster labels with max- 
imum probabilities with the ground truth. In this case, we choose 
the initial seed for GenClus as one of the tentative running results 
with the highest objective function, and the iteration number is set 
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Figure 7: Clustering Accuracy Comparisons for Setting 1 
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Figure 8: Clustering Accuracy Comparisons for Setting 2 



to 5. The first baseline is the fc-means algorithm, and the second 
is a spectral clustering method that combines the network struc- 
ture and attribute similarity as a new similarity matrix. We use the 
framework given in [20], which utilizes modularity objective func- 
tion in the network part, but we replace the cosine similarity by 
Euclidean distance in the attribute part as in [26] for better cluster- 
ing results. As neither methods can handle the problem of incom- 
plete attributes, we use interpolation to make each sensor have a 
regular 2-dimensional attribute, by using the mean of all the obser- 
vations of its neighbors and itself. For the spectral clustering-based 
framework, we centralize the data by extracting the mean and then 
normalize them by the standard deviation, in order to make the at- 
tribute part comparable with the modularity part in the objective 
function. Both parts are set to have equal weights. 

The results are summarized in Figs. 7 and 8. It is evident that 
GenClus exhibits superior performance over the two baselines in 
most of the data sets (17 out of 18 cases). Furthermore, GenClus 
can produce more stable clustering results compared with fc-means, 
which is very sensitive to the number of observations for each ob- 
ject, especially for Setting 2. GenClus is also highly adaptive: no 
need of any weight specification for combining the network and 
attribute-contributions to the clustering process. This results in 
greater stability for the GenClus algorithm. Another major ad- 
vantage of GenClus (which is not immediately evident from the 
presented results) is that we can directly utilize every observation 
instead of the mean, whereas the baselines can only use a biased 
mean value because of the interpolation process. 

5.2.2 Link Prediction Accuracy Test 

Next, the link prediction accuracy measured by MAP is com- 
pared between GenClus and the baselines. For the AC network, 
we select the link type {A, C) for prediction, namely, we want to 
predict which conferences that an author is likely to publish in. 
For the APC network, we select the link type {P, C) for predic- 
tion, namely, we want to predict which conference that a paper is 
published in. As the prediction is based on the similarity between 
the two objects, say query object m with clustering membership 
Gi and candidate object Vj with clustering membership Gj, three 
similarity functions are used here: (1) cosine similarity denoted 
as cos(#i, Gj); (2) the negative of Euclidean distance denoted as 
— \\Gi — Gj\\; and (3) the negative of cross entropy denoted as 
—H(Gj, Gi). The results are summarized in Tables 2 and 3. 
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NetPLSA 


iTopicModel 


GenClus 


cos((?i, Oj) 


0.4351 


0.5117 


0.7627 


—lie* — 0^11 


0.4312 


0.5010 


0.7539 


-H{0 jt 0i) 


0.4323 


0.5088 


0.7753 



Table 2: Prediction Accuracy for A-C Relation in AC Network 





NetPLSA 


iTopicModel 


GenClus 


cos((?i, Oj) 


0.2762 


0.4609 


0.5170 


-\\0i-0j\\ 


0.2759 


0.4600 


0.5142 


~H{0 h 0i) 


0.2760 


0.4683 


0.5183 



Table 3: Prediction Accuracy for P-C Relation in ACP Network 

For the weather sensor network, we select the link type (T, P), 
namely, we want to predict the P-typed neighbors for the T-typed 
sensors. We test the link prediction in the network with configura- 
tion as in Setting 1, with #T = 1000 and #P = 250. We only 
output the link prediction results for the GenClus algorithm, since 
the other two baselines can only output hard clusters (exact cluster 
memberships rather than probabilities). The results are shown in 
Table 4. 

| | cos(fli,flj) I -\\0j-0jW I -H{0 j ,O~T~\ 

I MAP I 0.7285 I 0.7690 I 0.8073 I 



Table 4: Prediction Accuracy for (T, P) in Weather Network 

From the results, it is evident that GenClus has the best link pre- 
diction accuracy in terms of different similarity functions. Also, the 
results show that the asymmetric function —H(9j , 0i) provides the 
best link prediction accuracy, especially for better clustering results 
such as those obtained by GenClus and in the weather sensor net- 
work where the out-link neighbors are different from the in-link 
neighbors. 

5.2.3 Analysis of Link Type Strength 

Since the process of learning the semantic importance of re- 
lations is important in a heterogeneous clustering approach, we 
present the learned relation strengths in Fig. 9 for the two DBLP 
four-area networks. From the figure, it is evident that in the AC 
Network, the link type {A, C) has greater importance to the clus- 
tering process than the link type {A, A), and thus is more important 
in deciding an author's membership. This is because the spectrum 
of co-authors may often be quite broad, whereas their publication 
frequency in each conference can be a more reliable predictor of 
clustering behavior. For the ACP Network, we can see that the link 
type {P, C) has the weight 3.13, whereas the link type {P, A) has a 
much higher weight 13.30. This suggests that the latter link type is 
more reliable in deciding the cluster for papers, since a conference 
usually covers a broader spectrum than an author. For example, it 
is difficult to judge the cluster for a paper if we only know that it is 
published in the CIKM conference. The ability of our algorithm to 
leam such important characteristics of different link types is one of 
the reasons that it is superior to other competing methods. 

For the weather sensor network, we summarize the link type 
strengths for the three networks with different network sizes that 
contain 5 observations for each sensor using the configuration of 
Setting 1, in Table 5. It is evident that GenClus correctly detects: 
(1) the P-typed sensors cannot be trusted as much as the other ones 
when P-typed sensors are very sparse, due to their farther distance 
and less similarity to other objects (the strengths of (T, P) and 
(P, P) relations decrease as #P decreases); and (2) for both types 




(a) AC Network (b) ACP Network 



Figure 9: Strength for Link Types in Two Four-area Networks 

of sensors, T-typed neighbors are more trustable than P-typed ones, 
due to the higher quality of T-typed data in the network setting. 





(T, T) 


(T,P) 


(P,T) 


(P,P) 


T:1000; P 


250 


3.14 


2.88 


1.60 


1.32 


T:1000; P 


500 


3.16 


3.05 


2.38 


1.98 


T:1000;P: 


1000 


3.14 


3.03 


3.34 


2.78 



Table 5: Link Type Strength for Weather Sensor Network in Setting 1 

5.3 A Typical Running Case 

One of the core ideas of this paper is to enable a mutual learning 
process between the importance of link types for clustering and the 
actual clustering results. In this section, we provide some detailed 
results at different iterations of the algorithm, which suggests that 
such a mutual learning process does occur. In particular, a typical 
running case for the AC Network is illustrated in Fig. 10. Fig. 
10(a) shows how the clustering accuracy progresses along with the 
changes in the importance of different link types. Fig. 10(b) shows 
how the strength weights change along with the clustering results at 
different iterations and finally converge to the correct values. Note 
that, we plotted the initial value 7 at iteration in Fig. 10(b), which 
is an all-one value. 




Number of Iterations Number of Iterations 



(a) Clustering Accuracy (b) Strength of Link Types 



Figure 10: A Running Case on AC Network: Iterations 1 to 10 

5.4 Efficiency Study 

To examine the efficiency of our algorithm, we illustrate the ex- 
ecution time of each inner iteration for the EM algorithm, which 
is the bottleneck component for the overall time complexity. The 
results are presented for the weather sensor network with different 
sizes and different numbers of observations for both pattern genera- 
tor settings. The results are illustrated in Fig. 11, and are consistent 
with our observations in the complexity section about the scalabil- 
ity with the number of objects. 

One observation is that the EM approach is very easy to par- 
allelize, which is the major component for GenClus. We tested 
the parallel version of the EM algorithm with the use of 4 paral- 
lel threads (each running on a 2.13 GHz processor), and it turned 
out that the execution time is improved by a factor of 3.19. This 
suggests that the approach is highly parallelizable. 
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(a) Pattern Setting 1 (b) Pattern Setting 2 




more similar to the third line, that is, two objects linking together 
indicates a higher chance that they have similar cluster member- 
ships. Moreover, we further associate each type of links with a 
different importance weight in measuring the consistency under 
a given clustering purpose, and thus each type of relation carries 
different strengths in passing the cluster membership between the 



Number of Objects Number of Objects linked 



Figure 11: Scalability Test over Number of Objects 



6. RELATED WORK 

Clustering is a classical problem in data analysis, and has been 
studied extensively in the context of multi-dimensional data [13]. 
Most of these algorithms are attribute based, in which the data cor- 
responds to a multi-dimensional format, and does not contain links. 
A number of clustering methods [5, 14, 6, 7] have been proposed 
on the basis of network structure only, mainly in the context of the 
community detection problem [2, 15, 8]. A recent piece of work 
extends the network clustering problem to the heterogeneous sce- 
nario [23]. However, this latter method [23] is designed for a spe- 
cific kind of network structure, referred as the star network schema, 
and is not applicable to networks of general structure. Furthermore, 
it cannot be easily integrated with attribute information. 

Recently, some studies [3, 20, 18, 25] have shown that by con- 
sidering the link constraints in addition to the attributes, the cluster- 
ing accuracy can be enhanced. However, most of these algorithms 
require that the network links, objects and their attributes are all 
homogeneous. A recent clustering method [28] integrates the net- 
work clustering process with categorical attributes by considering 
the latter as augmented objects, but the same methodology cannot 
be applied to numerical values. Some other algorithms [20] can 
cluster objects with numerical attributes by combining the network 
clustering objective function with a numerical clustering objective 
function, but it is difficult to decide the weight to combine them, 
and cannot deal with the incomplete attributes properly. [16] pro- 
vides a framework for clustering objects in relational networks with 
attributes. However, they studied a different clustering problem by 
clustering objects from different types separately, and did not study 
the interplay of importance of different link types and the clustering 
results. Probabilistic relational models, such as [24], provide a way 
to model a rational database containing both attributes and links, 
but do not consider the scenario studied in this paper that cluster- 
ing purposes could be different according to the specified attributes. 
Also, they cannot handle the problem of incomplete attributes due 
to the discriminative nature of their methods. 

There are several different philosophies on using the link infor- 
mation in addition to attributes to help the clustering in networks. 
First, in [20, 28], links are viewed to provide another angle of sim- 
ilarity measure between objects besides the attribute-based similar- 
ity measure, and the final clustering results are generated by com- 
bining the two angles. Second, In relational clustering [16] and 
probabilistic relational models [24], every link is treated as equally 
important and the probability of a link appearance is modeled ex- 
plicitly according to the cluster memberships of the two objects of 
the link, in a way of building mixture of block models [1]. Third, 
in [18, 22], links are considered to provide additional information 
about the similarity between objects that are consistent with the at- 
tributes, and the final clustering result is a more smoothing version 
compared with the one merely using attributes. However, none of 
these views is able to model the fact that different relations should 
have different importance in determining the clustering process for 
a certain purpose. Our philosophy in modeling link consistency is 



7. CONCLUSIONS 

We propose GenClus, the first approach to cluster general hetero- 
geneous information networks with different link types and differ- 
ent attribute types, such as numerical or text attributes, with guid- 
ance from a specified subset of the attributes. Our algorithm is 
designed to seamlessly work in the case when some of the nodes 
may not have the complete attribute information. One key observa- 
tion of the work is that heterogeneous network clustering provides a 
tremendous challenge because different types of links may present 
different levels of semantic importance to the clustering process. 
The importance of different semantic link types is learned in order 
to enable an effective clustering algorithm that meets a user's de- 
mand. We present experimental results which show the advantages 
of the approach over competing methods, including a number of 
interesting case studies and a study of the algorithm efficiency. 

8. REFERENCES 

[1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed 

membership stochastic blockmodels. J. Mach. Learn. Res., 

9:1981-2014, June 2008. 
[2] L. Backstrom, D. P. Huttenlocher, J. M. Kleinberg, and X. Lan. 

Group formation in large social networks: membership, growth, and 

evolution. In KDD, pages 44-54, 2006. 
[3] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework 

for semi-supervised clustering. In KDD, pages 59-68, 2004. 
[4] J. Bilmes. A Gentle Tutorial on the EM Algorithm and its 

Application to Parameter Estimation for Gaussian Mixture and 

Hidden Markov Models. University of Berkeley Tech Rep ICSITR, 

(ICSI-TR-97-021), 1997. 
[5] D. Bortner and J. Han. Progressive clustering of networks using 

structure-connected order of traversal. In ICDE, 653-656, 2010. 
[6] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. 

In KDD, pages 554-560, 2006. 
[7] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng. Evolutionary 

spectral clustering by incorporating temporal smoothness. In KDD, 

pages 153-162, 2007. 
[8] A. Clauset, M. E. J. Newman, and C. Moore. Finding community 

structure in very large networks. In Phys. Rev. E 70, 066111, 2004. 
[9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood 

from incomplete data via the em algorithm. Journal of the Royal 

Statistical Society, Series B, 39(1): 1-38, 1977. 
[10] J. Gao, W. Fan, Y. Sun, and J. Han. Heterogeneous source consensus 

learning via decision propagation and negotiation. In KDD, 2009. 
[11] T. Hofmann. Probabilistic latent semantic analysis. In UA1, 1999. 
[12] X. Hu and L. Xu. Investigation on several model selection criteria for 

determining the number of cluster. Neural Inform. Proces. - Lett, and 

Reviews, 4:1-10, 2004. 
[13] A. K. Jain and R. C. Dubes. Algorithms for clustering data. 

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988. 
[14] M.-S. Kim and J. Han. A particle-and-density based evolutionary 

clustering method for dynamic networks. PVLDB, 2(l):622-633, 

2009. 

[15] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. 

Statistical properties of community structure in large social and 

information networks. In WWW, 2008. 
[16] B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for 

relational clustering. In KDD, pages 470^179, 2007. 
[17] U. Luxburg. A tutorial on spectral clustering. Statistics and 

Computing, 17:395^116, December 2007. 



404 



[18] Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network 

regularization. In WWW, 2008. 
[19] G. Milligan and M. Cooper. An examination of procedures for 

determining the number of clusters in a data set. Psychometrika, 

50(2): 159-179, June 1985. 
[20] M. Shiga, I. Takigawa, and H. Mamitsuka. A spectral clustering 

approach to optimally combining numericalvectors with a modular 

network. In KDD, pages 647-656, 2007. 
[21] A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse 

framework for combining multiple partitions. /. Mach. Learn. Res. , 

3:583-617, March 2003. 
[22] Y. Sun, J. Han, J. Gao, and Y. Yu. itopicmodel: Information 

network-integrated topic modeling. In ICDM, pages 493-502, 2009. 
[23] Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous 

information networks with star network schema. In KDD, 2009. 
[24] B. Taskar, E. Segal, and D. Koller. Probabilistic classification and 

clustering in relational data. In IJCAI, pages 870-876, 2001. 
[25] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for 

community detection: a discriminative approach. In KDD, 2009. 
[26] C. H. Zha, H. Zha, X. He, C. Ding, H. Simon, and M. Gu. Spectral 

relaxation for k-means. In NIPS, 2001. 
[27] C. Zhai. Statistical Language Models for Information Retrieval. Now 

Publishers Inc., Hanover, MA, USA, 2008. 
[28] Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on 

structural/attribute similarities. Proc. VLDB Endow., 2(1), 2009. 

APPENDIX 

A. EM ALGORITHM PROOF 

In the E-step of tth iteration, the Q function, namely, the ex- 
pected value of gi under the conditional distribution of hidden vari- 
ables Z l with the meaning of cluster labels, given the observations 
{v[X]} and current parameters 0* _1 , /3 t_1 , is: 

Q = E zt l{ v lx]KeVx ,e*-i,/3*-i (3i(©,£>, z*)) 

= "£p(Z t \{v[X]} veVx ,e t -\l3 t - 1 )g 1 (e,l3,Z t ) 

z* 

where the link feature function / and mixture model function in 
gi(Q, f3, Z f ), the complete likelihood function of g>i(6,/3), can 
be expanded by substituting with Eqs. (6) and (3) in Eq. (9): 

9i(e,/3,z*) 

K m 
= E 7(0(e)Me)]T u , fc log0„, fc + £ £c«,,(loge„, z ^,,) 

e=<77,n> k=l *><=V X 1 = 1 

Since the feature function / (contained in the first part of gi) does 
not involve the observations of attributes and thus contains no hid- 
den cluster label for each observation, the conditional expectation 
under Z 1 of / is just / itself. Therefore, the Q function is then: 

K 

Q= £ 7(-«e))if(e) £ 0„, fc log0„,fc 

e=(y,u) k = l 

K m 

+ £ EEvlioe^.A,!)*^^ 1 " 1 ,^" 1 ) 

v£V x k = l 1 = 1 

where the conditional probability for the hidden cluster label for 
object v can be evaluated by: v(£,,i = fc|© t_1 , /3 t_1 ) oc 

In the M-step, new values for parameters 9* and /3 J are achieved 
by maximizing the Q function, with the help of introducing La- 
grangian multipliers. First, the parameter d\ for each object v is 
maximized, by fixing the value of other parameters evaluated at 
step t — 1, namely, {#1* and /3 i_1 , with the following up- 

dating rule for k = 1 to K: 

771 

<*« £ 7(0(e))«'(e)<7 fc 1 +l {veVx} Y, c -Mzl, l =k\e t - 1 ,f3 t - 1 

e=(v,u) 1 = 1 



where li veVx \ is the indicator function, which equals to 1 if v 
contains the attribute X, otherwise 0. 

Then the parameter fi\ is evaluated by fixing = 0* for 
each cluster k, using the following updating rule for / = 1 to m: 

0t,, oc £„ €Vx c v M*t,i = fcie*- 1 ,^- 1 ). 

B. CONCAVITY PROOF 

THEOREM 1. g2(T) defined in Eq. (14) is a concave function. 

PROOF. To show g' 2 is a concave function, we only need to show 
Hg' 2 (*y) is a negative definite matrix, the (i, j) element of which is 

dg' 2 h) if i azjg) i 

d 1 {r i )d 1 {r j ) ^ Z v (i) d 1 (r t )d 1 (r ] ) ^ lr ^^ } 

where Z v (-y) is the normalization function for p(9 v \6- v ). Since 
each conditional distribution for 9 V belongs to the exponential fam- 
ily with parameters 7, then g^f^l . } = cov„( 7 (r l ), l(r 3 )), 
which is the covariance between 7(r») and ~/(rj). In all, 

H{g' 2 ){{i)) = El=i "z^y cov " _ ih L since for each ob J ect 

v, the corresponding covariance matrix cov„ is positive semidef- 
inite, and the diagonal matrix denoted by -^1 is positive definite, 
then their linear combination with negative weights are negative 
definite. □ 

C. SYNTHETIC WEATHER NETWORK 
GENERATOR 

We now describe the weather sensor network generator. Assum- 
ing there are K weather patterns, each of which is defined as a 
Gaussian distribution over temperature and precipitation attributes 
with different parameters. A weather sensor network is built by 
considering the sensors as the objects in the network, links denot- 
ing the fc-nearest neighbors relationship, and temperature and pre- 
cipitation as attributes. Each sensor is a mixture model of different 
weather patterns, and nearby sensors have similar pattern coeffi- 
cients. Each sensor may have multiple observations, obtained at 
different times. The following specific steps and input parameters 
are required to enable the generation of the weather sensor network: 

• Network size. The number of temperature sensors is denoted 
by #T, the number of precipitation sensors by #P, and the 
number of nearest neighbors required for link construction by 
k. These are input parameters to the generation process. 

• Network structure. For each sensor, we randomly assign its 
location within a unit circle from the central point. An out- 
link exists between sensors i and j, if j is one of the k nearest 
neighbors (of the particular type corresponding to j) from i. 

• Weather pattern. Let K be the number of clusters (weather 
patterns). Each such pattern is specified with a mean and co- 
variance matrix over temperature and precipitation. The cir- 
cle is then partitioned equally into K rings, on the basis of 
distance from the central point. 

• Cluster membership. The cluster membership for each sensor 
is determined by their reciprocal of the distance to the center 
for each weather region. 

• Attribute observations. The number of observations is reg- 
ulated by the user-specified input parameter fj^obs. The at- 
tribute values at each sensor are generated according to the 
mixture model with the coefficients specified in its cluster 
membership. 
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